Reducing Long Tail Latencies in Geo-Distributed Systems

نویسنده

  • KIRILL L. BOGDANOV
چکیده

Computing services are highly integrated into modern society. Millions of people rely on these services daily for communication, coordination, trading, and accessing to information. To meet high demands, many popular services are implemented and deployed as geo-distributed applications on top of third party virtualized cloud providers. However, the nature of such deployment provides variable performance characteristics. To deliver high quality of service, such systems strive to adapt to ever-changing conditions by monitoring changes in state and making run-time decisions, such as choosing server peering, replica placement, and quorum selection. In this thesis, we seek to improve the quality of run-time decisions made by geo-distributed systems. We attempt to achieve this through: (1) a better understanding of the underlying deployment conditions, (2) systematic and thorough testing of the decision logic implemented in these systems, and (3) by providing a clear view into the network and system states which allows these services to perform better-informed decisions. We performed a long-term cross datacenter latency measurement of the Amazon EC2 cloud provider. We used this data to quantify the variability of network conditions and demonstrated its impact on the performance of the systems deployed on top of this cloud provider. Next, we validate an application’s decision logic used in popular storage systems by examining replica selection algorithms. We introduce GeoPerf, a tool that uses symbolic execution and lightweight modeling to perform systematic testing of replica selection algorithms. We applied GeoPerf to test two popular storage systems and we found one bug in each. Then, using traceroute and one-way delay measurements across EC2, we demonstrated persistent correlation between network paths and network latency. We introduce EdgeVar, a tool that decouples routing and congestion based changes in network latency. By providing this additional information, we improved the quality of latency estimation, as well as increased the stability of network path selection. Finally, we introduce Tectonic, a tool that tracks an application’s requests and responses both at the user and kernel levels. In combination with EdgeVar, it provides a complete view of the delays associated with each processing stage of a request and response. Using Tectonic, we analyzed the impact of sharing CPUs in a virtualized environment and can infer the hypervisor’s scheduling policies. We argue for the importance of knowing these policies and propose to use them in applications’ decision making process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Trash Day: Coordinating Garbage Collection in Distributed Systems

Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications. In this paper, we show that distributed applications suffer from each node...

متن کامل

Cost-Effective Geo-Distributed Storage for Low-Latency Web Services

We present two systems that we have developed to aid geo-distributed web services deployed in the cloud to serve their users with low latency. First, CosTLO [23] helps mitigate the high latency variance observed when accessing data stored in multi-tenant cloud storage services. To cost-effectively satisfy a web service’s tail latency goals, CosTLO uses measurement-driven inferences about replic...

متن کامل

Thesis proposal: Meeting tail latency SLOs in shared networked storage

Meeting tail latency Service Level Objectives (SLOs) in shared networked storage systems is an important and challenging problem in datacenters. Our work is motivated by three trends: First, companies like Google and Amazon are increasingly interested in long tails at the 99th and 99.9th percentile latencies [11, 12]. As technology improves, users are more accustomed to low latency and start to...

متن کامل

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection

Achieving predictable performance is critical for many distributed applications, yet difficult to achieve due to many factors that skew the tail of the latency distribution even in well-provisioned systems. In this paper, we present the fundamental challenges involved in designing a replica selection scheme that is robust in the face of performance fluctuations across servers. We illustrate the...

متن کامل

CrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems

Current transaction systems for geo-distributed datastores either have high transaction processing latencies or are unable to support general transactions with dependent operations. In this paper, we introduce CrossStitch, an efficient transaction processing framework that reduces latency by restructuring each transaction into a chain of state transitions, where each state consists of a key ope...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016